Nomilo Fishpond Biogeochemical Analysis
  • Correlational Analysis
  • Fieldwork Templates

Data Analysis Workflow:

  • Install Packages
  • Load Libraries
  • Import Raw Data
    • Procedure
    • View Raw Data
  • Tidy Raw Data
    • Tidying Processes
    • Merge Tidied Datasets
    • Export Tidied Datasets
  • Data Dictionary
  • Exploratory Data Analysis
  • Correlational Analysis

Nomilo Fishpond Biogoechemical Analysis

Interactive Code

Throughout this document, hover over the numbered annotations to the right of code chunks to reveal detailed explanations and comments about the code. Where drop-down italicized text is present, expand by pressing on arrow to see code.

Install Packages

install.packages(c("rio", "tidyverse", "janitor", "lubridate", "rmarkdown", "fs"))

Load Libraries

library(rio)
library(tidyverse)
library(janitor)
library(lubridate)
library(rmarkdown)
library(fs)
1
For importing excel data
2
For cleaning of data
3
For cleaning variable names
4
For cleaning dates
5
For displaying tables
6
For file path usage

Import Raw Data

Procedure

Define vector of files to import:

files_to_import <- dir_ls(path = "data/raw")

for (i in seq_along(files_to_import)) {
  cat(i, "= ", files_to_import[i], "\n")
}
1
Store the file paths of our raw data within the data/raw directory in files_to_import
2
Print each file path with its index
1 =  data/raw/2024-02-28_dfs.RData 
2 =  data/raw/2024-02-28_ksf-clam-growth.xlsx 
3 =  data/raw/2024-02-28_ksf-compiled-data.xlsx 
4 =  data/raw/2024-02-28_ksf-oyster-cylinder-growth.xlsx 
5 =  data/raw/2024-02-28_profile-data.xlsx 
6 =  data/raw/2024-02-28_water-samples.xlsx 
7 =  data/raw/2024-02-28_weather-data.xlsx 
8 =  data/raw/2024-03-01_dfs-no-profiles.RData 
9 =  data/raw/2024-03-01_dfs_no_profiles.RData 
10 =  data/raw/~$2024-02-28_profile-data.xlsx 
11 =  data/raw/~$2024-02-28_water-samples.xlsx 

Use the purrr::map() function to iteratively import files in the files_to_import vector except for the profiles data and .RData files:

@iteratively-import-raw-data Code Chunk Execution Warning

The @iteratively-import-raw-data code chunk should only be ran once when raw data is updated because it takes long to execute. Therefore, run the @efficiently-load-raw-data code chunk instead to easily import up-to-date raw data.

dfs_no_profiles <- map(files_to_import[c(2:4, 6, 7)], import_list)
current_date <- format(Sys.Date(), "%Y-%m-%d")
save(dfs_no_profiles, file = paste0("data/raw/", current_date, "_dfs-no-profiles.RData"))

Efficiently import up-to-date raw data:

load(files_to_import[8])

Rename datasets:

We will always use snakecase when naming our data objects and functions (e.g., data_object_name or function_name()).

names(dfs_no_profiles) <- gsub("data/raw/2024-02-28_|\\.xlsx$|\\.xls$", "", 
                               files_to_import[c(2:4, 6, 7)])
names(dfs_no_profiles) <- gsub("-", "_", names(dfs_no_profiles))
names(dfs_no_profiles)
1
Remove prefixes and file extensions
2
Replace hyphens with underscores
3
Check if names were outputted correctly
[1] "ksf_clam_growth"            "ksf_compiled_data"         
[3] "ksf_oyster_cylinder_growth" "water_samples"             
[5] "weather_data"              

Rename each sheet within each raw dataset to be lowercased and replace spaces with underscores:

dfs_no_profiles <- map(dfs_no_profiles, ~ set_names(.x, gsub(" ", "_", tolower(names(.x)))))

Create separate datasets by specifying the Excel sheet from each spreadsheet we want to tidy:

ksf_clam_growth_data <- dfs_no_profiles$ksf_clam_growth$sheet1
ksf_compiled_data <- dfs_no_profiles$ksf_compiled_data$full_data
ksf_oyster_cylinder_growth_data <- dfs_no_profiles$ksf_oyster_cylinder_growth$sheet1
water_samples_data <- dfs_no_profiles$water_samples$data_overview
weather_data <- dfs_no_profiles$weather_data$weather_ksf

We want to combine multiple sheets within the profiles Excel spreadsheet into one, therefore, we will import it separately:

sheets_to_import <- c("L1", "L2", "L3", "L4")

profiles_data <- profiles_data <- map_dfr(sheets_to_import, function(sheet_name) {
  import(files_to_import[5], which = sheet_name)
}) %>%
  bind_rows()
1
[code annotation]
2
[code annotation]
3
[code annotation]

View Raw Data

  • ksf_clam_growth_data
  • ksf_compiled_data
  • ksf_oyster_cylinder_growth_data
  • water_samples_data
  • weather_data
  • profiles_data

Tidy Raw Data

Tidying Processes

  • ksf_clam_growth_data_tidied
  • ksf_compiled_data_tidied
  • ksf_oyster_cylinder_growth_data_tidied
  • water_samples_data_tidied
  • weather_data_tidied
  • profile_data_tidied
Steps to clean data
new_var_names <- c(
  "sort_date", "color", "clams_in_count", "clams_in_lbs", "clams_in_avg_per_lb", 
  "clams_out_count", "clams_out_lbs", "clams_out_avg_per_lb", "growth_in_lbs",
  "growth_pct", "sr", "days_btwn_sort"
  )

new_date_col <- c(
  "2023-10-17", "2023-12-06", "2023-12-12", "2024-01-02", "2024-01-10", "2024-01-24", 
  "2024-01-31", "2024-02-08", "2024-02-13"
  ) 

new_date_col <- c("2023-10-17", "2023-12-06", "2023-12-12", "2024-01-02", "2024-01-10", "2024-01-24", "2024-01-31", "2024-02-08", "2024-02-13") 

ksf_clam_growth_data_tidied <- ksf_clam_growth_data %>%
  slice(-1) %>%
  setNames(new_var_names) %>%
  mutate(date = as.Date(new_date_col)) %>% 
  dplyr::select(-sort_date) %>% 
   pivot_longer(
    cols = c(
      clams_in_count, clams_in_lbs, clams_in_avg_per_lb, clams_out_count, 
      clams_out_lbs, clams_out_avg_per_lb
      ),
    names_to = c("stage", ".value"),
    names_prefix = "clams_", 
    names_sep = "_", 
    values_to = "value"
  ) %>%
  mutate(stage = if_else(str_detect(stage, "in"), "In", "Out")) %>%
  rename(avg_per_lbs = avg) %>% 
  mutate(across(c(color, stage), as.factor)) %>%
  mutate(across(c(count, lbs, avg_per_lbs, growth_in_lbs, growth_pct, sr), 
                ~as.numeric(gsub("%", "", .)))) %>% 
  arrange(date, color, stage) %>% 
  dplyr::select(date, days_btwn_sort, color, stage, count, lbs, avg_per_lbs, 
                growth_in_lbs, growth_pct, sr)

paged_table(ksf_clam_growth_data_tidied)
Steps to clean data
ksf_compiled_data_tidied <- ksf_compiled_data %>% 
  rename_with(~gsub("\\s*\\([^\\)]+\\)", "", .x)) %>%
  janitor::clean_names() %>%
  rename(date = date_time) %>%
  mutate(date = as.Date(date)) %>%
  filter(date >= as.Date("2023-11-20") & date <= as.Date("2024-02-20")) %>%
  arrange(date) %>%
  dplyr::select(-c(external_voltage, wk_num, wind_dir,
                   spadd, outdoor_temperature, hourly_rain,
                   solar_radiation, resistivity, battery_capacity,
                   hour, daynum, data_pt, wind_sp, diradd,
                   wind_speed, wind_direction, tide, day, month, year)
                ) %>%
  dplyr::select(where(~ !anyNA(.))) %>%
  group_by(date) %>%
  summarise(across(where(is.numeric), \(x) mean(x, na.rm = TRUE)))

paged_table(ksf_compiled_data_tidied)
1
Clean variable names by removing everything in parentheses, using lowercase and underscores in place of spaces
2
Rename the date_time variable to date, filter to desired date range and sort by date
3
Remove unnecessary variables
4
Remove columns with containing all NA values
5
Group by date and calculate the average of every variable for each day
Steps to clean data
oyster_var_names <- c(
  "date", "oyster_large_weight", "oyster_large_gain", "oyster_small_weight",
  "oyster_small_gain", "oyster_chlorophyll"
  )

ksf_oyster_cylinder_growth_data_tidied <- ksf_oyster_cylinder_growth_data %>% 
  dplyr::select(c(1, 4, 5, 8, 9, 12)) %>%
  slice(-1) %>%
  setNames(oyster_var_names) %>%
  pivot_longer(
    cols = c(oyster_large_weight, oyster_large_gain,
             oyster_small_gain,
             oyster_small_weight),
    names_to = c("oyster_size", ".value"),
    names_prefix = "oyster_",
    names_sep = "_",
    values_to = "value"
  ) %>%
  mutate(oyster_size = if_else(str_detect(oyster_size, "small"), "Small", "Large")) %>%
  mutate(oyster_size = as.factor(oyster_size),
         across(c(weight, gain), as.numeric)
        ) %>%
  filter(date >= as.Date("2023-11-20") & date <=
           as.Date("2024-02-14"))

paged_table(ksf_oyster_cylinder_growth_data_tidied)
1
Manually set variable names
2
Select desired columns and remove first row
3
Convert from wide to long format
4
Create a new variable that differentiates oyster size
5
Adjust data types to numeric and factor
6
Filter to desired date range

Address the NA values before merge

Steps to clean data
water_samples_data_tidied <- water_samples_data %>%
  slice(-c(44:52)) %>% 
  rename_with(~gsub("\\s*\\([^\\)]+\\)", "", .x)) %>%
  janitor::clean_names() %>%
  mutate(
    date = if_else(date == "44074", 
                   as.character(as.Date("2024-01-09")),
                   format(dmy(date), "%Y-%m-%d")) 
  )

paged_table(water_samples_data_tidied)

Done: - Date format - Date number conversion

The reason why it wasn’t working was because the date column was of a character data type, therefore we can easily convert 44074 but we were assuming that R knew the other dates were of a date type. So to fix this, we first needed to convert it into a date in the original format of DD-MM-YYYY, then convert it to our desired format of YYYY-MM-DD.

Need to do: - Names clean up - Add numbers to sample ID

Steps to clean data
weather_data_tidied <- weather_data

paged_table(weather_data_tidied)
Steps to clean data
profiles_data_tidied <- profiles_data

paged_table(profiles_data_tidied)

Merge Tidied Datasets

Export Tidied Datasets

Export tidied datasets to CSV into data/tidied folder:

source("code/functions/export_to_csv.R")

dfs_to_export <- list(
  ksf_clam_growth_data_tidied = ksf_clam_growth_data_tidied,
  ksf_compiled_data_tidied = ksf_compiled_data_tidied,
  ksf_oyster_cylinder_growth_data_tidied = ksf_oyster_cylinder_growth_data_tidied
)

imap(dfs_to_export, ~ export_to_csv(.x, .y, "data/tidied"))
1
List of dataframes we want to export as CSV files
2
Iterate the export_to_csv(df, df_name, dir_path) function over each dataframe. .x refers to the dataframe. .y refers to the name of the dataframe. These are passed to export_to_csv() function along with the desired directory path.

Export merged final data set into data/outputs folder.

Data Dictionary

Exploratory Data Analysis

Correlational Analysis

Back to top